Water kefir is a fizzy and mildly acidic beverage obtained through the process of fermenting a mixture of water, sugar, fruits, and water kefir grains. This particular drink has become more well-known as a result of its alleged probiotic qualities, which are thought to be advantageous to health. Additionally, water kefir is considered a potential vegan alternative to milk kefir. In order to gain a better understanding of the biological aspects and the mechanisms involved in water kefir production, several studies have aimed to identify the microbial species present in water kefir grains. It has been observed that the microbial communities in water kefir grains exhibit variations depending on factors such as the geographical origin of the grains and the specific ingredients used in the fermentation process. However, the majority of these communities consist of yeast, lactic acid bacteria, and acetic acid bacteria. Lactic and acetic acid bacteria are widely distributed groups of bacteria that possess the ability to metabolize sugars into acids. Consequently, these bacteria are commonly found in various types of fermented foods. Previous researches employing culture-based techniques have provided evidence for the presence of specific species, namely Lactobacillus harbinensis, Lactobacillus hilgardii, Lactobacillus paracasei, and Bifidobacterium aquikefiri, within various water kefir cultures. These particular species have demonstrated a high degree of conservation across different kefir cultures, as indicated by previous studies. However, in-depth exploration of the diversity and composition of water kefir has been limited, with only a few studies adopting metagenomic approaches such as shotgun metagenomics. The utilization of this DNA sequencing methodology enables comprehensive genomic analyses of all microorganisms present in a given sample, facilitating a precise determination of microbial species diversity within the water kefir community. Shotgun metagenomic sequencing is a powerful approach in microbial community analysis that integrates high-throughput sequencing technologies with computational pipelines. By sequencing the entire microbial communities present in multiple samples, we were able to investigate the stability and diversity of kefir microbiomes derived from two distinct starter cultures and fermented with either figs or raisins. Our study aimed to examine the variations between the initial microbial communities, their temporal composition changes, and how they were influenced by the surrounding environment.
Our experimental design involved cultivating kefir grains from two distinct starter cultures, namely Philipp starter and Vincent starter, using either raisins or figs as the fermentation substrate. To achieve a comprehensive genetic and metagenetic characterization of the kefir grains, we employed multiple methods in this study. Initially, we conducted whole genome sequencing on cultivable isolates derived from the starter culture kefir grains. This analysis utilized two complementary technologies: Illumina for generating short reads and Oxford Nanopore for producing long reads. These sequencing techniques were selected to ensure the generation of high-quality genome assemblies for our subsequent analyses. To assess the taxonomic diversity of our samples, we performed 16S amplicon sequencing, which was analyzed using the DADA2 workflow. Prior to analysis, we performed trimming and quality control steps using trimmomatic and cutadapt to remove the PCR primers utilized during the amplification process. Subsequently, fastQC was employed to verify the quality of the raw data obtained from the sequencing. The read counts were recorded both before and after trimming to ensure an adequate number of reads remained for downstream analysis.
The entire DADA2 analysis was conducted using R and the DADA2 package. The analysis began with the generation of quality profile plots. Subsequently, another trimming step was performed to retain only high-quality base pairs while ensuring sufficient overlap for Amplicon Sequence Variant (ASV) bridging in later stages. After trimming, the remaining reads were utilized to train an error rate model. Error plots were generated to verify that substitution errors fell within the expected range. Using the error rate model, errors were corrected, allowing for the inference of ASV variants. The corrected ASVs obtained from the previous step were merged to obtain complete paired-end ASV variants with an approximate length of 255 base pairs. These sequences were then used to construct a table of ASVs ranging from 250 to 256 base pairs. Following the removal of potential chimeric ASVs resulting from PCR artifacts, the taxonomic classification of ASVs was performed. The classified ASVs were saved and exported as a Phyloseq object, which was utilized to generate bar plots illustrating sample diversity. Subsequently, we proceeded with the further characterization of our samples using shotgun metagenomic sequencing. Similar to the 16S Amplicon data, initial steps of trimming and quality control were performed.
Why did we use the -g option ?
> We used the “-g” option to tell the algorithm to look for adapters
that precede (5’) the sequence of interest and trim them.
Why is there ‘strange’ nucleotides in the primers (e.g. H, W,
M..).?
> These symbols are IUPAC wildcards chracters and can refer to
multiple bases, in the case of a damaged DNAs. For example, W means an A
or a T.
# Forward and reverse fastq filenames have format: SAMPLENAME_1_paired_cut.fq.gz and SAMPLENAME_2_paired_cut.fq.gz
# PathReads contains the path to the files
FWDfiles <- sort(list.files(PathReads, pattern = "_1_paired_cut.fq.gz", full.names = TRUE))
REVfiles <- sort(list.files(PathReads, pattern = "_2_paired_cut.fq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_*.fq*
sample.names <- sapply(strsplit(basename(FWDfiles), "_"), `[`, 1)
sample.names
## [1] "K1" "K10" "K11" "K12" "K13" "K14" "K15" "K16" "K17" "K18" "K19" "K2"
## [13] "K20" "K21" "K22" "K23" "K24" "K25" "K26" "K27" "K28" "K29" "K3" "K30"
## [25] "K31" "K32" "K33" "K34" "K4" "K5" "K6" "K7" "K8" "K9"
Explain how this command line works:
sapply(strsplit(basename(FWDfiles), "_"),[, 1)
> This command line is used to extract the sample name of each sample
from the corresponding filename (K..). The basename function extracts
the name for each file of the forward files, without the directory path.
The strsplit function is used to split each filename obtained into
multiple parts using the “_” as delimitor. The sapply function applies
the operator “[” and passes the argument “1” to this operator, resulting
in “[1]”, meaning we take the first element of the vector (the K..).
# Quality scores of R1 reads
plotQualityProfile(FWDfiles[5:8])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# Quality scores of R2 reads
plotQualityProfile(REVfiles[5:8])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# Quality scores for aggregated files
plotQualityProfile(FWDfiles[1:18], n = 1e+06, aggregate = TRUE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plotQualityProfile(REVfiles[1:18], n = 1e+06, aggregate = TRUE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
What do you think of these quality profiles?
> Regarding our reads, the quality is good at the beginning but
decreases at the end. Here the green line represents the median quality
score and the red line represents the proportion of reads that reached a
certain length. Looking at the aggregated file, we have a good
proportion of reads that have a sufficient quality score and also many
reads reaching a sufficient length. We will however need to trim the
data before the analysis.
Explain what is the option n = 1e+06?
> The option of n = 1e+06 is used to specify the number of reads to
sample.
# Place filtered files in filtered/ subdirectory
PathTrim <- paste(PathToWD, "02_16SrRNA_data_trimmed_cut_trimmed", sep="/")
filtFWD <- file.path(PathTrim, paste0(sample.names, "_F_paired_cut_trimmed.fq.gz"))
filtREV <- file.path(PathTrim, paste0(sample.names, "_R_paired_cut_trimmed.fq.gz"))
names(filtFWD) <- sample.names
names(filtREV) <- sample.names
out <- filterAndTrim(FWDfiles, filtFWD,
REVfiles, filtREV,
truncLen=c(155,125), # There is around 84bp overlap
maxN=0,
maxEE=c(2,2),
truncQ=9,
rm.phix=TRUE,
compress=TRUE, multithread=FALSE)
# All parameters but truncLen are default DADA2 params
# Check this all makes sense
plotQualityProfile(filtFWD[1:18], n = 1e+06, aggregate=TRUE)
plotQualityProfile(filtREV[1:18], n = 1e+06, aggregate=TRUE)
Figure 1. Summary of the quality profiles of all our samples, as well as the aggregated.
Can you see a difference between these profiles and the ones
before trimming?
> Compared to the quality profiles before trimming, we can see that
the quality scores increased. We retained the majority of the reads with
more than 976’000 reads remaining.
errF <- learnErrors(filtFWD, randomize = TRUE, nbases = 3e8, multithread = TRUE)
## 300455255 total bases in 1938421 reads from 32 samples will be used for learning the error rates.
errR <- learnErrors(filtREV, randomize = TRUE, nbases = 3e8, multithread = TRUE)
## 258476750 total bases in 2067814 reads from 34 samples will be used for learning the error rates.
plotErrors(errF, nominalQ = TRUE)
plotErrors(errR, nominalQ = TRUE)
Figure 2. Log of the error frequency, as a function of the consensus quality score.
derepFWD <- derepFastq(filtFWD)
derepREV <- derepFastq(filtREV)
sample.names <- sapply(strsplit(basename(filtFWD),"_F_"),`[`,1)
names(derepFWD) <- sample.names
names(derepREV) <- sample.names
dadaFs <- dada(derepFWD, err = errF, multithread = TRUE)
dadaRs <- dada(derepREV, err = errR, multithread = TRUE)
dadaFs$K2
## dada-class: object describing DADA2 denoising results
## 275 sequence variants were inferred from 5130 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
dadaRs$K2
## dada-class: object describing DADA2 denoising results
## 222 sequence variants were inferred from 3405 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
What kind of objects are dadaFs and dadaRs?
> They are dada-class objects generated by denoising the forward
(dadaF) or reverse (dadaR) reads. They refer to the outputs generated by
the DADA2 algorithm during the sequence processing steps. These objects
contain the results of denoising and error correction performed on the
raw sequence data, precisely information such as the sequence variants,
their abundances, and other associated metadata.
How many sequence variants (ASVs) are inferred?
> In the forward, 275 sequence variants were inferred from 5130 input
unique sequences. In the reverse, 222 sequence variants were inferred
from 3405 input unique sequences.
mergers <- mergePairs(dadaFs, derepFWD, dadaRs, derepREV, verbose = TRUE, trimOverhang = TRUE)
# Checking the dataframe
head(mergers[[1]])
## sequence
## 1 TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGAAGTCGTGCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGTTCGAAAGCGTGGGTAGCAAACAGG
## 2 TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAGGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACCGGGAGACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG
## 3 TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCTCGGCTTAACCGAGGAAGCGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG
## 4 TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGAAGTCGTGCATTGGAAACTGGAGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGTTCGAAAGCGTGGGTAGCAAACAGG
## 5 TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGG
## 6 TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCTCGGCTTAACCGAGGAAGCGCATCGGAAACTGGAAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 24231 1 1 27 0 0 2 TRUE
## 2 19093 2 2 27 0 0 2 TRUE
## 3 3937 4 3 27 0 0 2 TRUE
## 4 3338 3 1 27 0 0 2 TRUE
## 5 2852 6 4 27 0 0 2 TRUE
## 6 2805 5 3 27 0 0 2 TRUE
What information can you find in the merger object?
> The mergers object is a dataframe summarizing the stats for each
sequence, such as the abundance, the number of matches and mismatches,
etc. Each line is a sequence, and we have information about both the
forward and reverse.
seqtab <- makeSequenceTable(mergers)
# Check distribution
table(nchar(getSequences(seqtab)))
##
## 101 107 137 157 163 168 170 171 178 181 203 221 223 224 225 227
## 1 1 1 1 1 1 2 1 1 1 1 11 3 2 1 5
## 228 229 231 233 234 243 244 250 251 252 253 254 255 256 257 258
## 1 1 1 1 1 1 3 6 10 379 6064 278 12 3 1 2
## 260 263 265 267 268
## 1 1 1 1 1
nrow(seqtab)
## [1] 34
Why do we expect ASVs within the above mentioned length
range?
> We expect ASVs to be in this range, because we used Illumina
sequencing and it is the usual length of the sequences. After merging
the Fwd and Rev reads, we expect sequences in the range of [250:256] bp.
However, some sequences may be shorter or longer than expected due to
wrong merging.
What are the dimension of the seqtab?
> The seqtab has 34 rows (which are the sequences) and 6795 columns
(which are the samples).
What can you say about the distribution of the sequence
lengths?
> As said above, the length is mainly between 252 and 254, with the
most common length being 253. This is because it is the size of our
Illumina sequences.
Explain this command line
table(nchar(getSequences(seqtab)))
> getSequences extracts the unique sequences from our seqtab object.
nchar counts the number of nucleotides for each sequence. The function
table then creates a frequency table from the vector of sequence
lengths.
seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 252:254]
# Check distribution
table(nchar(getSequences(seqtab2)))
##
## 252 253 254
## 379 6064 278
# Dimensions
dim(seqtab2)
## [1] 34 6721
# Difference with seqtab
ncol(seqtab)-ncol(seqtab2) # 82 ASVs have been removed
## [1] 82
What are the dimensions of seqtab2?
> We have still 34 sequences but 6713 samples.
How many ASVs have been removed?
> 82 ASVs have been removed (6795-6713).
seqtab.nochim <- removeBimeraDenovo(seqtab2, method = "consensus", multithread = TRUE, verbose = TRUE)
## Identified 1649 bimeras out of 6721 input sequences.
(1 - (ncol(seqtab.nochim)/ncol(seqtab2))) * 100
## [1] 24.53504
ncol(seqtab.nochim)
## [1] 5072
What is the percentage of chimeric sequences?
> To get the percentage of chimeric sequences, we can use the number
of columns of our files, since every sequence is stored in a different
column. We have 24.5% of chimeric sequences.
How many ASVs do you have in the end?
> We have 5067 ASVs in the end.
taxa <- assignTaxonomy(seqtab.nochim, "/mnt/raidarray/home/SHARED/2023_SAGE2/SILVA/silva_nr99_v138_train_set.fa.gz", multithread=FALSE)
taxa2 <- addSpecies(taxa, "/mnt/raidarray/home/SHARED/2023_SAGE2/SILVA/silva_species_assignment_v138.fa.gz")
# Creating table for display
taxa.print <- taxa
rownames(taxa.print) <- NULL
head(taxa.print)
## Kingdom Phylum Class Order
## [1,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
## [2,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
## [3,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
## [4,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
## [5,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
## [6,] "Bacteria" "Actinobacteriota" "Actinobacteria" "Bifidobacteriales"
## Family Genus
## [1,] "Lactobacillaceae" "Lactobacillus"
## [2,] "Lactobacillaceae" "Lactobacillus"
## [3,] "Lactobacillaceae" "Lactobacillus"
## [4,] "Lactobacillaceae" "Lactobacillus"
## [5,] "Lactobacillaceae" "Lactobacillus"
## [6,] "Bifidobacteriaceae" NA
PathTax <- paste(PathToWD, "03_Taxonomy", sep = "/")
dir.create(PathTax)
## Warning in dir.create(PathTax): '/mnt/raidarray/home/etu19/DADA2/03_Taxonomy'
## existe déjà
# Save the ASV sequences and the Taxonomy table
write.csv2(file = paste(PathTax, "Taxtable_dada2.csv", sep = "/"), taxa2)
write.csv2(file = paste(PathTax, "ASV_sequences.csv", sep = "/"), seqtab.nochim)
# Open the data frame containing sample information
metaData <- read.csv("/mnt/raidarray/home/SHARED/2023_SAGE2/Metadata_kefir.csv", fill = T)
rownames(metaData) = metaData$id_sample
#Create a phyloseq object
ps_raw <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=F),
sample_data(metaData),
tax_table(taxa2))
meta(ps_raw) # This command allows to check the metadata of a PhyloSeq object
## id_sample treatment DNAconc inoculum fruit time label
## K1 K1 2_vin_rais 77.77 vincent raisins T2 s10
## K10 K10 2_vin_rais 19.15 vincent raisins T2 s26
## K11 K11 1_vin_figs 60.69 vincent figs T1 s17
## K12 K12 4_phi_rais 246.36 philipp raisins T4 s20
## K13 K13 1_vin_figs 8.38 vincent figs T1 s46
## K14 K14 2_vin_rais 72.45 vincent raisins T2 s22
## K15 K15 3_phi_figs 72.17 philipp figs T3 s29
## K16 K16 4_phi_rais 76.66 philipp raisins T4 s28
## K17 K17 0_start 130.74 start figs START philipp2
## K18 K18 2_vin_rais 159.86 vincent raisins T2 s6
## K19 K19 1_vin_figs 30.34 vincent figs T1 s37
## K2 K2 0_start 40.81 start figs START philipp1
## K20 K20 4_phi_rais 142.11 philipp raisins T4 s36
## K21 K21 2_vin_rais 87.73 vincent raisins T2 s14
## K22 K22 3_phi_figs 77.94 philipp figs T3 s41
## K23 K23 0_start 30.24 start figs START vincent1
## K24 K24 1_vin_figs 36.59 vincent figs T1 s9
## K25 K25 4_phi_rais 157.40 philipp raisins T4 s8
## K26 K26 3_phi_figs 135.71 philipp figs T3 s35
## K27 K27 3_phi_figs 69.05 philipp figs T3 s19
## K28 K28 0_start 13.75 start figs START vincent2
## K29 K29 1_vin_figs 61.98 vincent figs T1 s29
## K3 K3 4_phi_rais 57.10 philipp raisins T4 s16
## K30 K30 3_phi_figs 113.28 philipp figs T3 s31
## K31 K31 4_phi_rais 132.41 philipp raisins T4 s40
## K32 K32 4_phi_rais 40.72 philipp raisins T4 s32
## K33 K33 3_phi_figs 75.24 philipp figs T3 s3
## K34 K34 4_phi_rais 73.05 philipp raisins T4 s12
## K4 K4 1_vin_figs 46.57 vincent figs T1 s33
## K5 K5 3_phi_figs 42.10 philipp figs T3 s7
## K6 K6 2_vin_rais 35.52 vincent raisins T2 s18
## K7 K7 4_phi_rais 24.82 philipp raisins T4 s4
## K8 K8 2_vin_rais 44.89 vincent raisins T2 s38
## K9 K9 1_vin_figs 39.14 vincent figs T1 s5
## NovoID student_Number Family_Name First_name AllFactor
## K1 FKDN220086523-1A 10 Mottet Leonie ALL
## K10 FKDN220086532-1A 26 Cergneux Julien ALL
## K11 FKDN220086533-1A 17 Tharmakulasinkam Karunnya ALL
## K12 FKDN220086534-1A 20 Corset Margaux ALL
## K13 FKDN220086535-1A 46 Hafez 2 Daniel ALL
## K14 FKDN220086536-1A 22 Jann Alexandre ALL
## K15 FKDN220086537-1A 29 Brochet Silvia ALL
## K16 FKDN220086538-1A 28 Ndiaye Malick ALL
## K17 FKDN220086539-1A 44 Engel Philipp ALL
## K18 FKDN220086540-1A 6 Liakopoulos Petros ALL
## K19 FKDN220086541-1A 37 Burz Sebastian ALL
## K2 FKDN220086524-1A 45 Engel Philipp ALL
## K20 FKDN220086542-1A 36 Matthey Noemi ALL
## K21 FKDN220086543-1A 14 Peng Xiaojing ALL
## K22 FKDN220086544-1A 41 Somerville Vincent ALL
## K23 FKDN220086545-1A 43 Somerville Vincent ALL
## K24 FKDN220086546-1A 9 Hafez Daniel ALL
## K25 FKDN220086547-1A 8 Gwyther Philip ALL
## K26 FKDN220086548-1A 35 Bonilla-Rosso German ALL
## K27 FKDN220086549-1A 19 Estelle Vivien ALL
## K28 FKDN220086550-1A 42 Somerville Vincent ALL
## K29 FKDN220086551-1A 29 Engel Philipp ALL
## K3 FKDN220086525-1A 16 Buvelot Mathilde ALL
## K30 FKDN220086552-1A 31 de Bakker Vincent ALL
## K31 FKDN220086553-1A 40 Liberti joanito ALL
## K32 FKDN220086554-1A 32 Garrido-Sanz Daniel ALL
## K33 FKDN220086555-1A 3 Moix Samuel ALL
## K34 FKDN220086556-1A 12 Dentand Alexis ALL
## K4 FKDN220086526-1A 33 Sarton-Loheac Garance ALL
## K5 FKDN220086527-1A 7 Garrido Marques Antonio ALL
## K6 FKDN220086528-1A 18 de Coning Elindi ALL
## K7 FKDN220086529-1A 4 Schmidt Camille ALL
## K8 FKDN220086530-1A 38 Gibson Paddy ALL
## K9 FKDN220086531-1A 5 Ben Amara Cyril ALL
Which command allows you to check the sample metadata from the
PhyloSeq object?
> This command allows to check the metadata: meta(ps_raw).
What is added to the phyloseq object with this command line
merge_phyloseq(ps_raw, dna)?
> The merge_phyloseq() function is used to add the ASV sequences to
the existing ps_raw phyloseq object. The dna object, which contains the
ASV sequences associated with their respective taxonomic units, is
merged into the ps_raw object. This addition allows the phyloseq object
to have both taxonomic information and the corresponding ASV
sequences.
Explain which data are we merging with this command line table =
merge(tax_table(ps_raw), t(otu_table(ps_raw)),
by=“row.names”)
> The merge() function is used to merge two tables, the taxonomic
table and the OTU (Operational Taxonomic Unit) table. The
tax_table(ps_raw) function retrieves the taxonomic table from the ps_raw
phyloseq object, while otu_table(ps_raw) retrieves the OTU table. The
merging is done based on the row names, which typically represent the
taxonomic identifiers or names of each taxonomic unit. The resulting
merged table, stored in the table variable, combines the taxonomic
information and abundance data for each taxon.
What does the phyloseq object contains in terms of data
?
> The phyloseq object contains the OTU Table (Operational Taxonomic
Unit) or ASV in each sample, taxonomic table: (taxonomic information for
each OTU or ASV), sample Data (sample-specific metadata) and refseq
object contains the reference sequences associated with the taxonomic
units.
What do you think the next steps will be? Is is useful to save
the ASVs sequences?
> Yes it is useful to save the ASVs sequences for potential next
steps, such as statistical analysis, visualization, or downstream
analysis using the phyloseq object (ps_raw).
Have a look at the Taxtable_dada2.csv and ASV_sequences.csv
files. What information can you find in them ?
> ASV_sequences.csv contains the actual nucleotide sequences of the
ASVs. Taxtable_dada2.csv contains the taxonomic information associated
with each ASV.
List each step where filtering happened in this
pipeline.
> We filtered with fastQC and trimmomatic. We trimmed again based on
several parameters such as maxN, maxEE, truncQ etc. We did trimmomatic,
cutadapt, we filtered, denoised the forward and reverse, merged and
removed the nonchim. We lost the most reads on the filtering step (see
plot below).
# Loading the table 'read_counts_overview_FWD.txt' and setting the column names
preDADA2 <- read.table("/mnt/raidarray/home/SHARED/2023_SAGE2/read_counts_overview_FWD.txt", sep = "\t")
colnames(preDADA2) <- c("Sample_name", "Before", "After_trimmomatic", "After_cutadapt")
# ordering the table by the sample names
preDADA2 <- preDADA2[order(preDADA2$Sample_name),]
getN <- function(x) sum(getUniques(x))
reads_counts <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(reads_counts) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(reads_counts) <- sample.names
# Check dimensions
dim(preDADA2) ; dim(reads_counts)
## [1] 34 4
## [1] 34 6
# Check samples order
preDADA2[,1] ; rownames(reads_counts)
## [1] "K1_1.fq.gz " "K10_1.fq.gz " "K11_1.fq.gz " "K12_1.fq.gz " "K13_1.fq.gz "
## [6] "K14_1.fq.gz " "K15_1.fq.gz " "K16_1.fq.gz " "K17_1.fq.gz " "K18_1.fq.gz "
## [11] "K19_1.fq.gz " "K2_1.fq.gz " "K20_1.fq.gz " "K21_1.fq.gz " "K22_1.fq.gz "
## [16] "K23_1.fq.gz " "K24_1.fq.gz " "K25_1.fq.gz " "K26_1.fq.gz " "K27_1.fq.gz "
## [21] "K28_1.fq.gz " "K29_1.fq.gz " "K3_1.fq.gz " "K30_1.fq.gz " "K31_1.fq.gz "
## [26] "K32_1.fq.gz " "K33_1.fq.gz " "K34_1.fq.gz " "K4_1.fq.gz " "K5_1.fq.gz "
## [31] "K6_1.fq.gz " "K7_1.fq.gz " "K8_1.fq.gz " "K9_1.fq.gz "
## [1] "K1" "K10" "K11" "K12" "K13" "K14" "K15" "K16" "K17" "K18" "K19" "K2"
## [13] "K20" "K21" "K22" "K23" "K24" "K25" "K26" "K27" "K28" "K29" "K3" "K30"
## [25] "K31" "K32" "K33" "K34" "K4" "K5" "K6" "K7" "K8" "K9"
# Join the dataframes
ReadsTrack <- cbind(sample.names, preDADA2[,2:4], reads_counts[,2:6])
# Compute the fraction of kept reads
ReadsTrack$fracKept <- ReadsTrack$nonchim / ReadsTrack$Before
# Vizualize the fraction of kept reads for each samples
ggplot(ReadsTrack, aes(x = sample.names, y = fracKept))+
geom_bar(stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))
# Save Boxplot and read track table
boxplot(ReadsTrack[, -c(1, 10)], ylim = c(0, 120000), xlab = "Steps of the analysis", ylab = "Number of reads kept", cex = 0.5)
Figure 3. Number of reads kept after each step of the analysis. The steps are input, filtered, denoisedF, denoisedR, merged and nonchim.
# Finally save the reads count table
write.csv(ReadsTrack, paste0(PathToWD, "/Reads_tracking.csv"), row.names = FALSE)
From which package the function getUniques comes from ? Briefly,
what does it produce as output?
> This function comes from the package dada2. It extracts the
uniques-vector from several different data objects, including dada-class
and derep-class objects, as well as data.frame. It creates an integer
vector named by unique sequence and valued by abundance.
Why is the read count similar between the After_trimmomatic and
After_cutadapt steps?
> The read count is similar between these two steps because their
functions are quite similar, both are in charge of removing the adapters
and the low-quality reads.
div <- function(x, y){ (1 - (x/y)) * 100 }
lostReads <- div(ReadsTrack[, 3:9], ReadsTrack[, 2])
head(lostReads)
## After_trimmomatic After_cutadapt filtered denoisedF denoisedR merged
## 3 1.332314 1.332314 39.36528 39.92098 39.67797 41.03322
## 1 1.476246 1.476246 37.67920 38.41318 38.12078 39.89653
## 2 1.535759 1.535759 34.02447 34.61180 34.27644 35.55845
## 4 1.038442 1.038442 38.05583 38.56459 38.35372 39.51684
## 5 2.311714 2.311714 48.83288 49.00913 48.96800 49.54177
## 6 2.643139 2.643139 52.35951 52.54531 52.44564 53.10766
## nonchim
## 3 43.05633
## 1 41.71017
## 2 37.63186
## 4 41.02556
## 5 51.19257
## 6 54.80330
averageLost <- mean(lostReads$nonchim)
lostperSpecies <- lostReads$nonchim
names(lostperSpecies) <- ReadsTrack$sample.names
averageLost
## [1] 44.40997
lostperSpecies
## K1 K10 K11 K12 K13 K14 K15 K16
## 43.05633 41.71017 37.63186 41.02556 51.19257 54.80330 57.12436 60.31704
## K17 K18 K19 K2 K20 K21 K22 K23
## 61.48127 57.44805 53.27990 42.50547 58.77055 54.34547 58.34894 48.37818
## K24 K25 K26 K27 K28 K29 K3 K30
## 49.86821 31.44042 33.30494 33.13039 26.19974 37.33019 45.27746 33.62564
## K31 K32 K33 K34 K4 K5 K6 K7
## 32.88389 33.91757 33.15235 35.00574 39.90740 50.15377 49.46577 42.87434
## K8 K9
## 43.79889 37.18310
min(ReadsTrack$nonchim)
## [1] 34279
What is the average lost reads ? At which step did you loose the
most reads ?
> On average, we lost 44% of our reads. We lost the most reads during
our trimming of the data (function filterAndTrim()).
Did a sample lose more reads than all the other ?
> No, overall, the loss is well distributed across our samples. The
maximum lost is sample K17 with more than 61% of loss.
Do you think enough reads are remaining ?
> Yes, as we started with a huge number of reads we still have a
significant amount. For example, the minimum number of reads for a
sample is still 34280 for the K18.
Why for targeted metagenomics (or amplicon sequencing) the 16S
rRNA gene is used?
> The 16S rRNA gene is commonly used in targeted metagenomics or
amplicon sequencing due to its ubiquity, conserved regions,
universality, and extensive database availability. It serves as a
phylogenetic marker, allowing to analyze the diversity and relationships
of microbial communities. Universal primers designed for this gene
enable the simultaneous amplification of multiple taxa. Its evolutionary
conservation facilitates comparative studies across different samples
and the inference of evolutionary relationships.
Examine the file. How many 16S rRNA genes have been identified in
our collection of assembled genomes?
> By looking at the fasta file, we see that we have 82 sequences. As
it is forward and reverse, we have 41 16S rRNA genes that have been
identified in our collection of assembled genomes.
What is a query? How many queries are in the blast
output?
> The queries are each sequence in our fasta file, so we have 82
queries.
What is the difference between a query and its hits?
> A query is the sequence we give and the hits are the results, so
the potential matches.
# Load blast results
blast_res = read.table("/mnt/raidarray/home/etu19/Blast_16SrRNAs_vs_ASVs_results.txt",
header = F,
sep = "\t",
comment.char = "#")
head(blast_res)
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV4 100.000 253 0 0 549 801
## 2 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV17 99.605 253 1 0 549 801
## 3 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV1 98.819 254 1 2 549 801
## 4 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV9 98.425 254 2 2 549 801
## 5 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV1333 98.024 253 5 0 549 801
## 6 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV548 97.638 254 4 2 549 801
## V9 V10 V11 V12 V13
## 1 1 253 1.88e-132 468 16
## 2 1 253 8.75e-131 462 16
## 3 1 253 1.89e-127 451 16
## 4 1 253 8.82e-126 446 16
## 5 1 253 4.10e-124 440 16
## 6 1 253 1.91e-122 435 16
Do you know what comment.char = “#” does?
> comment.char specifies which character should be used as a comment
character. Every line which starts with this character will be ignored
during the import.
# Extract the column names
blast_cols = read.csv2("/mnt/raidarray/home/etu19/Blast_16SrRNAs_vs_ASVs_results.txt",
header = F,
sep = ",",
skip = 3)[1,]
# Rename fields to avoid weird characters and spaces
blast_cols = gsub("# Fields: ", "", blast_cols)
blast_cols = gsub("^ ", "", blast_cols)
blast_cols = gsub(" ", ".", blast_cols)
blast_cols = gsub("%", "perc", blast_cols)
# Assign names to blast_res column names
colnames(blast_res) = blast_cols
head(blast_res)
## query.acc.ver subject.acc.ver perc.identity
## 1 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV4 100.000
## 2 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV17 99.605
## 3 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV1 98.819
## 4 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV9 98.425
## 5 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV1333 98.024
## 6 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV548 97.638
## alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1 253 0 0 549 801 1 253
## 2 253 1 0 549 801 1 253
## 3 254 1 2 549 801 1 253
## 4 254 2 2 549 801 1 253
## 5 253 5 0 549 801 1 253
## 6 254 4 2 549 801 1 253
## evalue bit.score perc.query.coverage.per.subject
## 1 1.88e-132 468 16
## 2 8.75e-131 462 16
## 3 1.89e-127 451 16
## 4 8.82e-126 446 16
## 5 4.10e-124 440 16
## 6 1.91e-122 435 16
Do you know what the last part of the read.csv2 command
skip = 3)[1,] is doing ?
> The skip argument specifies the number of lines of the data file to
skip before beginning to read data. In this case, it skips the first 3
lines. The [1,] says that we keep only the header.
What is the gsub() function doing ?
> gsub() searches for a specific substring in a string, and replaces
it with another substring (specified in the arguments). In our case, it
formats the headers so that they are more readable.
# Split query names into columns
blast_res$query.acc.ver = gsub("16S_rRNA::", "", blast_res$query.acc.ver)
blast_res = blast_res %>% separate(query.acc.ver,
c("Genome", "contig", "coords", "strand"),
sep = "_contig_|:|\\(|\\)")
head(blast_res)
## Genome contig coords strand subject.acc.ver perc.identity
## 1 ESL0963 1 403998-405554 + ASV4 100.000
## 2 ESL0963 1 403998-405554 + ASV17 99.605
## 3 ESL0963 1 403998-405554 + ASV1 98.819
## 4 ESL0963 1 403998-405554 + ASV9 98.425
## 5 ESL0963 1 403998-405554 + ASV1333 98.024
## 6 ESL0963 1 403998-405554 + ASV548 97.638
## alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1 253 0 0 549 801 1 253
## 2 253 1 0 549 801 1 253
## 3 254 1 2 549 801 1 253
## 4 254 2 2 549 801 1 253
## 5 253 5 0 549 801 1 253
## 6 254 4 2 549 801 1 253
## evalue bit.score perc.query.coverage.per.subject
## 1 1.88e-132 468 16
## 2 8.75e-131 462 16
## 3 1.89e-127 451 16
## 4 8.82e-126 446 16
## 5 4.10e-124 440 16
## 6 1.91e-122 435 16
Do you understand what the _contig_|:|\\(|\\) within
the separate() function is doing ?
> This argument specifies that the value should be separated at
“\contig\”, “:”, “(” and “)”. These characters will also be
removed. This way, the value will be split in 4 columns: contig,
coordinates, strand and the ASV id.
# Filter BLAST
filt_threshold = 100
blast_filt = filter(blast_res, perc.identity >= filt_threshold)
head(blast_filt)
## Genome contig coords strand subject.acc.ver perc.identity
## 1 ESL0963 1 403998-405554 + ASV4 100
## 2 ESL0963 1 1720439-1721994 - ASV4 100
## 3 ESL0961 1 268380-269947 + ASV5 100
## 4 ESL0965 1 267413-268980 + ASV5 100
## 5 ESL0967 1 267163-268730 + ASV5 100
## 6 ESL0969 1 269573-271140 + ASV5 100
## alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1 253 0 0 549 801 1 253
## 2 253 0 0 548 800 1 253
## 3 253 0 0 556 808 1 253
## 4 253 0 0 556 808 1 253
## 5 253 0 0 556 808 1 253
## 6 253 0 0 556 808 1 253
## evalue bit.score perc.query.coverage.per.subject
## 1 1.88e-132 468 16
## 2 1.88e-132 468 16
## 3 1.89e-132 468 16
## 4 1.89e-132 468 16
## 5 1.89e-132 468 16
## 6 1.89e-132 468 16
# Count the different ASVs
length(unique(blast_filt$subject.acc.ver))
## [1] 10
Can you count how many different ASVs matched at a 100% sequence
identity our assembled genomes ?
> There are 10 different ASVs that matched at a 100% sequence
identity with our assembled genomes.
# Create a dataframe with 3 columns
genome_tax = data.frame("Genome"=c("ESL0961", "ESL0965", "ESL0967", "ESL0969",
"ESL0962", "ESL0970", "ESL0968_iso_01",
"ESL0976", "ESL0964", "ESL0966", "ESL0968_iso_02",
"ESL0971", "ESL0972", "ESL0974", "ESL0963"),
"Copy.no.16S" = c("5","5","5","5","5","5","5","5",
"6", "6", "6", "6", "6", "6", "6"),
"Taxa"=c("Lacticaseibacillus paracasei",
"Lacticaseibacillus paracasei",
"Lacticaseibacillus paracasei",
"Lacticaseibacillus paracasei",
"Lentilactobacillus parabuchneri",
"Lentilactobacillus parabuchneri",
"Lentilactobacillus parabuchneri",
"Lentilactobacillus hilgardii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus ghanensis"))
head(genome_tax)
## Genome Copy.no.16S Taxa
## 1 ESL0961 5 Lacticaseibacillus paracasei
## 2 ESL0965 5 Lacticaseibacillus paracasei
## 3 ESL0967 5 Lacticaseibacillus paracasei
## 4 ESL0969 5 Lacticaseibacillus paracasei
## 5 ESL0962 5 Lentilactobacillus parabuchneri
## 6 ESL0970 5 Lentilactobacillus parabuchneri
# Add this information to the blast results
blast_filt = left_join(blast_filt, genome_tax, by="Genome")
head(blast_filt)
## Genome contig coords strand subject.acc.ver perc.identity
## 1 ESL0963 1 403998-405554 + ASV4 100
## 2 ESL0963 1 1720439-1721994 - ASV4 100
## 3 ESL0961 1 268380-269947 + ASV5 100
## 4 ESL0965 1 267413-268980 + ASV5 100
## 5 ESL0967 1 267163-268730 + ASV5 100
## 6 ESL0969 1 269573-271140 + ASV5 100
## alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1 253 0 0 549 801 1 253
## 2 253 0 0 548 800 1 253
## 3 253 0 0 556 808 1 253
## 4 253 0 0 556 808 1 253
## 5 253 0 0 556 808 1 253
## 6 253 0 0 556 808 1 253
## evalue bit.score perc.query.coverage.per.subject Copy.no.16S
## 1 1.88e-132 468 16 6
## 2 1.88e-132 468 16 6
## 3 1.89e-132 468 16 5
## 4 1.89e-132 468 16 5
## 5 1.89e-132 468 16 5
## 6 1.89e-132 468 16 5
## Taxa
## 1 Liquorilactobacillus ghanensis
## 2 Liquorilactobacillus ghanensis
## 3 Lacticaseibacillus paracasei
## 4 Lacticaseibacillus paracasei
## 5 Lacticaseibacillus paracasei
## 6 Lacticaseibacillus paracasei
Do you know what the left_join() function is doing?
How the two data.frames (blast_res and genome_tax) are being combined
?
> The left_join() function is taking all the common
genomes between the two data frames, getting the corresponding lines
from the right dataframe (all columns) and pasting these columns to the
left dataframe.
# ASVs that matched at 100%
plot_match = ggplot(blast_filt, aes(y=paste0(Taxa," ",Genome))) +
geom_bar(aes(fill=subject.acc.ver)) + #ASVs will be represented as bars
geom_point(aes(x=-1.5, color=Taxa), size=2.5) + #points will represent the genomes taxa
geom_text(aes(x=-3.2), label=blast_filt$Copy.no.16S)+ #number of 16S rRNA copies per genome
ggtitle("Isolate-ASV matching")+
#Color scale for the Species based on assembled genomes
scale_color_manual(values=c("coral",
"bisque4",
"lightgoldenrod2",
"aquamarine3",
"deepskyblue3"))+
#Color scale for the ASVs matching
scale_fill_d3("category20c")+
#ticks, breaks and labels of the x scale
scale_x_continuous(position = "top",
breaks = c(-3.2,0,2,4,6,8,10,15,20),
labels = c("16S rRNA \ncopies",0,2,4,6,8,10,15,20)
) +
#Title of X axis
xlab(paste0(filt_threshold, "% ASV match count")) +
#titles for the legend.
labs(fill="ASV", color="Species (whole-genome)") +
#theme and theme options
theme_classic() +
theme(axis.title.y = element_blank(),
axis.line = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x.top = element_text(),
legend.key.size = unit(0.4, 'cm'),
legend.key.height = unit(0.4, 'cm'),
legend.key.width = unit(0.4, 'cm'))
plot_match
# Specific ASV matching each of the 16S rRNa copies
ggplot(blast_filt, aes(y=paste0(Taxa," ",Genome," ", coords))) +
geom_bar(aes(fill=subject.acc.ver)) +
geom_point(aes(x=-0.2, color=Genome), size=2, shape=15) +
scale_color_d3("category20")+
scale_fill_d3("category20c") +
scale_x_continuous(position = "top")+
xlab(paste0(filt_threshold, "% ASV match count")) +
labs(fill="ASV", color="Genome") +
theme_classic() +
theme(axis.title.y = element_blank(),
axis.line = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x.top = element_text(),
axis.text.y = element_text(size = 5),
legend.key.size = unit(0.3, 'cm'),
legend.key.height = unit(0.3, 'cm'),
legend.key.width = unit(0.3, 'cm'))
Figure 4. For each genome, the number of ASV matching at 100% and which species they are representing.
Why most genomes have multiple ASVs matching ?
> Because they have many copies of 16S rRNA.
How can you explain that L. paracasei genomes match 8 ASV while
these genomes only contain 5 copies of the 16S rRNA gene? and others
like L. nagelii that have 6 copies of the 16S rRNA match with less
?
> This is because of the BLAST results, where a contig can match
different ASVs with 100%. This is due to several genes containing the
same sequnence. It can also be that one ASV is 1 nucelotide longer.
Furthermore, if we lower the threshold (99% match for example), the
number of matches increases. We notice that ASVs are not the whole gene
but just a part of it. therefore less precise.
ps <- readRDS("/mnt/raidarray/home/etu19/DADA2/04_Phyloseq_object/PhyloSeq_Object.rds")
ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5066 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 12 sample variables ]
## tax_table() Taxonomy Table: [ 5066 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 5066 reference sequences ]
ncol(ps@otu_table)
## [1] 5066
From the above output (executing directly the name of the
phyloseq object), how many ASVs are there? How many samples ?
> There are 5066 ASVs in our PhyloSeq object, and there are 34
samples.
ps@otu_table[1:5,1:5]
## OTU Table: [5 taxa and 5 samples]
## taxa are columns
## ASV1 ASV2 ASV3 ASV4 ASV5
## K1 24232 19093 3937 3337 2805
## K10 12045 8152 2716 3604 1970
## K11 11960 34964 4263 2762 2951
## K12 11729 11190 1675 10797 1467
## K13 11792 1136 1064 858 737
The otu_table contains raw sequence numbers or relative abundance
?
> The otu_table contains raw sequence numbers.
# Convert into relative abundance
ps_ra = transform_sample_counts(ps, function(x) x/sum(x))
ps_ra@otu_table[1:5,1:5]
## OTU Table: [5 taxa and 5 samples]
## taxa are columns
## ASV1 ASV2 ASV3 ASV4 ASV5
## K1 0.3615745 0.28489361 0.05874541 0.04979259 0.04185443
## K10 0.2446381 0.16556991 0.05516289 0.07319847 0.04001137
## K11 0.1701329 0.49736835 0.06064184 0.03928988 0.04197843
## K12 0.1664136 0.15876619 0.02376527 0.15319022 0.02081412
## K13 0.2365496 0.02278837 0.02134403 0.01721163 0.01478435
# Look at the top 250 ASVs
toptax = names(sort(taxa_sums(ps), decreasing=TRUE))[1:250]
ps.toptax = transform_sample_counts(ps, function(x) {x/sum(x)})
ps.toptax <- prune_taxa(toptax, ps.toptax)
plot_bar(ps.toptax, x="id_sample", fill="Phylum") +
ylab("Relative abundance") + xlab("") +
ggtitle("Phylum level", "Top 250 ASVs") +
theme(legend.key.size = unit(0.3, 'cm'), #change legend key size
legend.key.height = unit(0.3, 'cm'), #change legend key height
legend.key.width = unit(0.3, 'cm'), #change legend key width
legend.title = element_blank(), #remove legend title font size
legend.text = element_text(size=7), #change legend text font size
legend.position = "top", #change legend position
axis.text.x = element_text(vjust = 0.5))
Figure 5. Top 250 ASVs, showing which phylum they represent and in which abundance, for each sample.
Which phyla dominate the water kefir samples? Do you observe
dramatic changes across samples ?
> The Firmicutes dominate the water kefir samples. There are no
dramatic changes, although Proteobacteria and Actinobacteriota can get
up to 50 and 40% respectively.
# Retrieve a list of the exact ASVs
matching_asvs = unique(blast_filt$subject.acc.ver)
matching_asvs
## [1] "ASV4" "ASV5" "ASV3" "ASV8" "ASV17" "ASV1" "ASV9"
## [8] "ASV2" "ASV4438" "ASV16"
# Filter
ps_raASVs = prune_taxa(colnames(otu_table(ps_ra)) %in% matching_asvs, ps_ra)
ps_raASVs
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 12 sample variables ]
## tax_table() Taxonomy Table: [ 10 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 10 reference sequences ]
What is the
colnames(otu_table(ps_ra)) %in% matching_asvs doing
?
> This code is checking which ASVs are present in matching_asvs, and
keeping them for the prune_taxa() function.
# Add info to phyloseq object
tax = as.data.frame(ps_raASVs@tax_table)
tax$ASV = rownames(tax)
#Add tax again to the phyloseq object
tax_table(ps_raASVs) = as.matrix(tax)
# Plot at the species level
sp_plot = plot_bar(ps_raASVs, x="id_sample", fill="Species") +
ylab("Relative abundance") + xlab("") +
ggtitle("Species level", paste0(filt_threshold, "% ASV matching")) +
theme(legend.key.size = unit(0.3, 'cm'), #change legend key size
legend.key.height = unit(0.3, 'cm'), #change legend key height
legend.key.width = unit(0.3, 'cm'), #change legend key width
legend.title = element_blank(), #remove legend title font size
legend.text = element_text(size=7), #change legend text font size
legend.position = "top", #change legend position
axis.text.x = element_text(vjust = 0.5)) #vertically center x labels
# Plot at the genus level
gen_plot = plot_bar(ps_raASVs, x="id_sample", fill="Genus") +
ylab("Relative abundance") + xlab("") +
ggtitle("Genus level", paste0(filt_threshold, "% ASV matching")) +
theme(legend.key.size = unit(0.3, 'cm'),
legend.key.height = unit(0.3, 'cm'),
legend.key.width = unit(0.3, 'cm'),
legend.title = element_blank(),
legend.text = element_text(size=7),legend.position = "top")
ggarrange(sp_plot, #plot 1
gen_plot + theme(axis.title.y = element_blank()), #plot2
ncol = 2,
align = "h",
labels = "AUTO") #puts the A, B... of the panels.
Do the genomes that we obtained during SAGE I represent a big
part of the community ?
> Yes they represent a big part, because the relative abundance is
close to 100%.
How many of the matching ASVs are classified at the Species
level? Do you observe the 5 different species identified in the
collection of assembled genomes ?
> There are 2 matching ASVs: L. paracasei and L.
ghanensis. No, we don’t observe the 5 different species identified
in the collection of assembled genome.
Check also the classification of the matching ASVs at the Genus
level. What do you see? Do you see the 3 different genera identified in
the collection of the assembled genomes ?
> We only have Lactobacillus and Reyranella, but
the latter is almost not present.
How can you explain the differences observed in the taxonomy of
ASVs at the genus and species levels compared with the taxonomy of the
assembled genomes ?
> The difference in taxonomy between ASVs and assembled genomes is
due to the smaller resolution of ASVs, compared to genomes. So, there
can be sequence errors and artifacts. Of course, ASVs do not provide a
complete genome representation, so variation in other genomic regions
can be missed.
# Coloring by ASV
asv_plot <- plot_bar(ps_raASVs, x="id_sample", fill="ASV") +
ylab("Relative abundance") +
ggtitle("ASV level", paste0(filt_threshold, "% ASV matching")) +
theme(legend.key.size = unit(0.3, 'cm'),
legend.key.height = unit(0.3, 'cm'),
legend.key.width = unit(0.3, 'cm'),
legend.title = element_blank(),
legend.text = element_text(size=7),
legend.position = "top")
asv_plot
Can you generate a plot with ggarrange() containing the three
plots? at the ASV level, Species level and Genus level ?
> See code and plot below.
ggarrange(sp_plot,
gen_plot + theme(axis.title.y = element_blank()),
asv_plot + theme(axis.title.y = element_blank()),
ncol = 3,
align = "h",
labels = "AUTO")
Figure 6. Top 250 ASVs matching at 100%, at the species, genus and ASV level.
# Filtering BLAST results
ASV_reclass = blast_filt %>% select(Taxa, "ASV"=subject.acc.ver) %>% unique()
ASV_reclass
## Taxa ASV
## 1 Liquorilactobacillus ghanensis ASV4
## 3 Lacticaseibacillus paracasei ASV5
## 8 Lacticaseibacillus paracasei ASV3
## 9 Lacticaseibacillus paracasei ASV8
## 17 Liquorilactobacillus ghanensis ASV17
## 35 Liquorilactobacillus nagelii ASV1
## 62 Liquorilactobacillus nagelii ASV9
## 70 Lentilactobacillus hilgardii ASV2
## 75 Liquorilactobacillus nagelii ASV4438
## 76 Lentilactobacillus parabuchneri ASV16
# Reassign the taxonomy
tax2 = left_join(tax, ASV_reclass, by="ASV")
rownames(tax2) = tax2$ASV
# Add tax again to the phyloseq object
tax_table(ps_raASVs) = as.matrix(tax2)
ps_raASVs
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 12 sample variables ]
## tax_table() Taxonomy Table: [ 10 taxa by 9 taxonomic ranks ]
## refseq() DNAStringSet: [ 10 reference sequences ]
# Plot
reclass_ASVs = plot_bar(ps_raASVs, x="id_sample", fill="Taxa") +
ylab("Relative abundance") +
ggtitle("Reclassified ASVs", paste0(filt_threshold, "% ASV matching")) +
scale_fill_manual(values=c("coral",
"bisque4",
"lightgoldenrod2",
"aquamarine3",
"deepskyblue3"))
reclass_ASVs
What are the differences you observe with this plot with
reassignation of ASVs taxonomy based on whole-genome information
compared to the previous ones based on the taxonomic classification of
ASVs agains the SILVA database ?
> We found 5 different species based on their whole genome, whereas
before we were only able to find 1 or 2.
What is causing these differences ?
> We can find information outside of the 16S which allows for better
classification and taxonomy.
reclass_type <- plot_bar(ps_raASVs, x="label", fill="Taxa") +
ylab("Relative abundance") +
ggtitle("Reclassified ASVs") +
scale_fill_manual(values=c("coral",
"bisque4",
"lightgoldenrod2",
"aquamarine3",
"deepskyblue3"))+
facet_wrap(~inoculum+fruit, scales = "free_x", nrow = 1)
reclass_type
Figure 7. Reclassified ASVs, at the species level, grouped by the origin and food of the species.
Can you create a composite figure showing the reclassification of
ASVs at these two different thresholds (100% and 99%) ?
> See code and plots below.
# Filter BLAST
filt_threshold = 99
blast_filt_99 = filter(blast_res, perc.identity >= filt_threshold)
# Count the different ASVs
length(unique(blast_filt_99$subject.acc.ver))
## [1] 13
# Create a dataframe with 3 columns
genome_tax_99 = data.frame("Genome"=c("ESL0961", "ESL0965", "ESL0967", "ESL0969",
"ESL0962", "ESL0970", "ESL0968_iso_01",
"ESL0976", "ESL0964", "ESL0966", "ESL0968_iso_02",
"ESL0971", "ESL0972", "ESL0974", "ESL0963"),
"Copy.no.16S" = c("5","5","5","5","5","5","5","5",
"6", "6", "6", "6", "6", "6", "6"),
"Taxa"=c("Lacticaseibacillus paracasei",
"Lacticaseibacillus paracasei",
"Lacticaseibacillus paracasei",
"Lacticaseibacillus paracasei",
"Lentilactobacillus parabuchneri",
"Lentilactobacillus parabuchneri",
"Lentilactobacillus parabuchneri",
"Lentilactobacillus hilgardii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus nagelii",
"Liquorilactobacillus ghanensis"))
# Add this information to the blast results
blast_filt_99 = left_join(blast_filt_99, genome_tax_99, by="Genome")
head(blast_filt_99)
## Genome contig coords strand subject.acc.ver perc.identity
## 1 ESL0963 1 403998-405554 + ASV4 100.000
## 2 ESL0963 1 403998-405554 + ASV17 99.605
## 3 ESL0963 1 1720439-1721994 - ASV4 100.000
## 4 ESL0963 1 1720439-1721994 - ASV17 99.605
## 5 ESL0961 1 268380-269947 + ASV5 100.000
## 6 ESL0961 1 268380-269947 + ASV3 99.605
## alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1 253 0 0 549 801 1 253
## 2 253 1 0 549 801 1 253
## 3 253 0 0 548 800 1 253
## 4 253 1 0 548 800 1 253
## 5 253 0 0 556 808 1 253
## 6 253 1 0 556 808 1 253
## evalue bit.score perc.query.coverage.per.subject Copy.no.16S
## 1 1.88e-132 468 16 6
## 2 8.75e-131 462 16 6
## 3 1.88e-132 468 16 6
## 4 8.75e-131 462 16 6
## 5 1.89e-132 468 16 5
## 6 8.82e-131 462 16 5
## Taxa
## 1 Liquorilactobacillus ghanensis
## 2 Liquorilactobacillus ghanensis
## 3 Liquorilactobacillus ghanensis
## 4 Liquorilactobacillus ghanensis
## 5 Lacticaseibacillus paracasei
## 6 Lacticaseibacillus paracasei
# Retrieve a list of the exact ASVs
matching_asvs_99 = unique(blast_filt_99$subject.acc.ver)
matching_asvs_99
## [1] "ASV4" "ASV17" "ASV5" "ASV3" "ASV8" "ASV1" "ASV9"
## [8] "ASV2" "ASV366" "ASV25" "ASV3323" "ASV4438" "ASV16"
# Filter
ps_raASVs_99 = prune_taxa(colnames(otu_table(ps_ra)) %in% matching_asvs_99, ps_ra)
ps_raASVs_99
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 13 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 12 sample variables ]
## tax_table() Taxonomy Table: [ 13 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 13 reference sequences ]
# Add info to phyloseq object
tax_99 = as.data.frame(ps_raASVs_99@tax_table)
tax_99$ASV = rownames(tax_99)
#Add tax again to the phyloseq object
tax_table(ps_raASVs_99) = as.matrix(tax_99)
# Filtering BLAST results
ASV_reclass_99 = blast_filt_99 %>% select(Taxa, "ASV"=subject.acc.ver) %>% unique()
# Reassign the taxonomy
tax2_99 = left_join(tax_99, ASV_reclass_99, by="ASV")
rownames(tax2_99) = tax2_99$ASV
# Add tax again to the phyloseq object
tax_table(ps_raASVs_99) = as.matrix(tax2_99)
# Plot
reclass_ASVs_99 = plot_bar(ps_raASVs_99, x="id_sample", fill="Taxa") +
ylab("Relative abundance") +
ggtitle("Reclassified ASVs", paste0(filt_threshold, "% ASV matching")) +
scale_fill_manual(values=c("coral",
"bisque4",
"lightgoldenrod2",
"aquamarine3",
"deepskyblue3"))
ggarrange(reclass_ASVs, #plot 1
reclass_ASVs_99 + theme(axis.title.y = element_blank()), #plot2
ncol = 2,
labels = "AUTO",
common.legend = TRUE,
legend = "right")
reclass_type_99 <- plot_bar(ps_raASVs_99, x="label", fill="Taxa") +
ylab("Relative abundance") +
ggtitle("Reclassified ASVs") +
scale_fill_manual(values=c("coral",
"bisque4",
"lightgoldenrod2",
"aquamarine3",
"deepskyblue3"))+
facet_wrap(~inoculum+fruit, scales = "free_x", nrow = 1)
ggarrange(reclass_type, #plot 1
reclass_type_99 + theme(axis.title.y = element_blank()), #plot2
ncol = 2,
align = "h",
labels = "AUTO",
common.legend = TRUE,
legend = "right")
Figure 8. Comparison of reclassified ASVs, at a 100% (A) and 99% (B) matching. On the bottom figure, it is grouped by treatment.
rank_names(ps)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
get_taxa_unique(ps,"Kingdom")
## [1] "Bacteria" "Archaea"
ps <- subset_taxa(ps, Kingdom != "Archaea" | is.na(Kingdom))
ps <- subset_taxa(ps, Order != "Chloroplast" | is.na(Order))
ps <- subset_taxa(ps, Family != "Mitochrondria" | is.na(Family))
pl.rare.all <- ggrare(ps, step = 1000, label = "Sample", se = FALSE) + xlab("Sequencing Depth") + ylab("ASV Richness")
Figure 9. Rarefaction plot of all the samples.
How well-sampled are our samples? Would you recommend more
sequencing? Or would you recommend less sequencing depth next
time?
> There are well-sampled because they all reach a plateau. Sequencing
depth is good, as it allow for the ASVs with a low species richness to
have a sufficient sequencing depth to be able to compare them with ASVs
that have a high species richness.
How heterogeneous is the sequencing depth?
> It is not too heterogeneous, with the lowest sequencing depth being
around 63’000 and the highest around 100’000.
How fast are the curves flattening? What does this
mean?
> Some of then are flattening really fast, like the K15 and K24
(flatten at 10’000). On the opposite side, samples with a higher species
richness flatten later, like K34 (flattens at 75’000). It means that
there are less diversity in the ones that flatten quickly. So, adding
ASVs doesn’t increase the diversity.
Can we use these samples like this? How could it possibly affect
our analyses?
> Yes, we can use the samples like this, because the smaller samples
(K6 and K15) have a sequencing depth at which all other samples have
reached their plateau so there is no need to rarefy. Otherwise we would
lose information.
# Calculate prevalence across all samples
ps.prev <- estimate_prevalence(ps,group="AllFactor",rarefy = FALSE)
ps.prev$samplePrev <- ps.prev$prevalence * 34 # Convert to absolute sample numbers
ps.prev$relAbund <- ( ps.prev$abundance / sum(ps.prev$abundance) ) # Calculate relative abundance
#Plot
pl.prev.his <- ggplot(ps.prev, aes(x=samplePrev))+
geom_histogram(colour="steelblue",fill="steelblue", binwidth = 1, boundary = -0.5) +
scale_x_continuous(breaks = 1:34) +
xlab("Number of samples in which ASVs are found") +
ylab("Number of ASVs") +
stat_bin(binwidth=1, geom='text', colour="black", size = 2, aes(label=..count..), position=position_stack(vjust = 1.1))
pl.prev.his
## Compare prevalence with relative abundance
# select the "Phylum" taxonomic assignation for your ASVs
ps.tax.class <- data.frame(tax_table(ps)[,2])
# Plot
pl.prev.sct <- ggplot(ps.prev, aes(x=samplePrev, y=relAbund, color=ps.tax.class$Phylum)) +
geom_point() +
theme(legend.title=element_blank(),legend.text = element_text(size = 8),
legend.position="bottom") +
xlab("Number of samples in which ASVs are found") +
ylab("Relative abundance")
pl.prev.sct
Figure 10. Number of samples in which ASVs are found, y-axis is the number of ASVs (top) and the relative abundance (bottom).
Does the majority of the ASVs have high or low
prevalence?
> The majority of ASVs has a low prevalence. The prevalence is the
number of sample where a ASV is found. We have very few ASV that are in
lots of sample. We have 3’000 ASV that are in only one sample
(contamination, misassignation, chimera) these are not super meaningful.
They are called singleton. For waterkefir, all originated from the same
culture.
Are there any ASVs occurring in all samples? Were we expecting
it?
> There are 9 ASVs occuring in all 34 samples. As they all come from
the same culture at the beginning we do expect that.
Are the most abundant ASVs dominating a single or few samples or
are they present in all samples?
> We have 5 or 10 ASV that are in all sample. The ones that are
present in all sample we expect them to be highly abundant. As we
expected the ones that are present in only one sample have very low
abundance. And the one the are present in all sample are the most
abundant, which is nice, they all belong to the same phylum. This is
expected as it is the dominating genus.
# Calculate how many total ASVs
tmp_df <- t(otu_table(ps))
ps.asv <- data.frame(colSums(tmp_df!=0))
colnames(ps.asv) <- "ASVs"
rm(tmp_df)
# Now add the total read number
ps.asv$reads <- sample_sums(ps)
colnames(ps.asv) <- c("ASVs", "reads")
ps.asv$sampleID <- rownames(ps.asv)
# Make barplots for each sample:
pl.asv.bar <- ggplot(ps.asv, aes(x = sampleID, y = ASVs) ) +
ggtitle("Total ASVs per sample") +
ylab("ASV number") +
geom_bar(stat="identity", fill="steelblue")+
theme(axis.text.x = element_text(angle=90))
pl.red.bar <- ggplot(ps.asv, aes(x = sampleID, y = reads) ) +
ggtitle("Total reads per sample") +
ylab("Read number") +
geom_bar(stat="identity", fill="indianred")+
theme(axis.text.x = element_text(angle=90))
# Histogram of ASVs and read counts
pl.asv.his <- ggplot(ps.asv, aes(x = reads)) +
geom_histogram(color = "black", fill = "indianred", binwidth = 5000) +
ggtitle("Total reads distribution") +
xlab("Reads") +
ylab("Samples")+
theme(axis.text.x = element_text(angle=90))
pl.red.his <- ggplot(ps.asv, aes(x = ASVs)) +
geom_histogram(color = "black", fill = "steelblue", binwidth = 50) +
ggtitle("Total ASV distribution") +
xlab("ASVs") +
ylab("Samples")+
theme(axis.text.x = element_text(angle=90))
# Visualize
ggarrange(pl.red.bar, pl.asv.bar, pl.asv.his, pl.red.his, ncol=2, nrow=2)
Figure 11. Visualisation of the number of reads and ASVs per sample, and the number of samples per read and per ASVs.
summary(ps.asv)
## ASVs reads sampleID
## Min. : 52.0 Min. :34228 Length:34
## 1st Qu.:106.5 1st Qu.:46330 Class :character
## Median :202.5 Median :55012 Mode :character
## Mean :238.4 Mean :56753
## 3rd Qu.:318.8 3rd Qu.:67024
## Max. :599.0 Max. :84590
What can you learn from these plots?
> In red, we can see the total of reads per sample and the
distribution of these reads. In blue, we can see the total of ASVs per
sample and the distribution.
How variable is the total ASV richness across samples?
> The number of ASVs varies between 52 and 599, with a mean of 238.4,
so we see that it is highly variable.
summary(lm(ps.asv$ASVs ~ ps.asv$reads))
##
## Call:
## lm(formula = ps.asv$ASVs ~ ps.asv$reads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -163.72 -67.53 -21.64 61.54 218.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.659e+02 7.559e+01 -3.518 0.00133 **
## ps.asv$reads 8.886e-03 1.298e-03 6.848 9.56e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99.44 on 32 degrees of freedom
## Multiple R-squared: 0.5944, Adjusted R-squared: 0.5817
## F-statistic: 46.89 on 1 and 32 DF, p-value: 9.56e-08
pl.asv_red.scp <- ggplot(ps.asv, aes(x=reads, y=ASVs)) +
geom_point()+
geom_smooth(method=lm)
## Compare ASVs against DNA yields
# Add DNA yields from the sample metadata
tmp1 <- sample_data(ps)[,"DNAconc"]
ps.asv2 <- merge(ps.asv, tmp1, by="row.names", all=TRUE)
rownames(ps.asv2) <- ps.asv2$sampleID
ps.asv2[,"Row.names"] <- NULL
# Compare ASVs against DNA yield
summary(lm(ps.asv2$ASVs ~ ps.asv2$DNAconc))
##
## Call:
## lm(formula = ps.asv2$ASVs ~ ps.asv2$DNAconc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -186.05 -131.73 -34.20 81.03 360.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 227.7023 47.4950 4.794 3.61e-05 ***
## ps.asv2$DNAconc 0.1434 0.5269 0.272 0.787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 156 on 32 degrees of freedom
## Multiple R-squared: 0.002309, Adjusted R-squared: -0.02887
## F-statistic: 0.07405 on 1 and 32 DF, p-value: 0.7873
pl.asv_dna.scp <- ggplot(ps.asv2, aes(x=DNAconc, y=ASVs)) +
geom_point()+
geom_smooth(method=lm)
rm(tmp1, tmp2)
# Visualize
ggarrange(pl.asv_dna.scp, pl.asv_red.scp, ncol=2, nrow=1)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Figure 12. Linear regressions umber of ASVs as a function of DNA extraction yields (left), and sequencing depth (right)
Is the number of ASVs biased by the DNA extraction yields? Is it
statistically significant?
> No, it is not biased because the points are randomly distributed
and not necessarily around the blue line. We can say our regression does
not find a trend in our datapoints. Moreover, the p-value is not
significant (p = 0.787).
Is the number of ASVs biased by the sequencing depth? Is it
statistically significant?
> Yes, it is biased by the sequencing depth. The points are
concentrated around the line. The p-value is significant (p = 9.56e-08).
So the number of ASVs is correlated with the sequencing depth.
## First lets count how many Phyla are in each sample
get_taxa_unique(ps,taxonomic.rank = "Phylum")
## [1] "Firmicutes" "Actinobacteriota"
## [3] "Proteobacteria" "Desulfobacterota"
## [5] "Bacteroidota" "Verrucomicrobiota"
## [7] "Cyanobacteria" "Deferribacterota"
## [9] "Marinimicrobia (SAR406 clade)" "Spirochaetota"
## [11] NA "Campilobacterota"
## [13] "SAR324 clade(Marine group B)" "Acidobacteriota"
## [15] "Chloroflexi" "Planctomycetota"
## [17] "Myxococcota" "Fibrobacterota"
## [19] "NB1-j" "Nitrospirota"
## [21] "Abditibacteriota" "Gemmatimonadota"
## [23] "Dadabacteria" "Armatimonadota"
## [25] "Bdellovibrionota" "Entotheonellaeota"
## [27] "Fusobacteriota" "Methylomirabilota"
## [29] "Nitrospinota" "Synergistota"
## [31] "RCP2-54" "PAUC34f"
## [33] "Elusimicrobiota" "WPS-2"
## [35] "LCP-89" "FCPU426"
## [37] "Latescibacterota" "Deinococcota"
## [39] "Patescibacteria" "Deferrisomatota"
## [41] "Cloacimonadota" "Acetothermia"
## [43] "Dependentiae" "Calditrichota"
## [45] "Caldatribacteriota" "AncK6"
## [47] "Margulisbacteria" "Sumerlaeota"
# Let''s look at the distribution of phyla in each sample
ps.phy <- tax_glom(ps, taxrank="Phylum")
ps.phy.df <- psmelt(ps.phy)
# Plot each treatment group next to each other
ps.phy.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Phylum")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the phylum level")
K22_df <- ps.phy.df[ps.phy.df$id_sample == "K22",]
nrow(K22_df)
## [1] 47
# Now let's do the same thing at the genus level, displaying only the 20 most abundant genera
# select top 10 most prevalent Families over the whole dataset
ps.gen <- tax_glom(ps, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)
# Plot each treatment group next to each other
ps.gen.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the Genus level")
Figure 13. Relative abundance for every sample, at the phylum and genus level.
Are there any dominant taxon?
> The dominant taxon is Firmicutes at the phylum level, and it is
Lactobacillus at the genus level.
Is this similar to what has been reported in the
literature?
> Yes, it is similar to what has been reported in the literature.
library(genefilter)
ps.prop = transform_sample_counts(ps, function(x) {x/sum(x)})
ps.filt = filter_taxa(ps.prop, filterfun(kOverA(1, 0.01)), TRUE)
# Then save an object with the same ASVs but with raw counts
toKeep <- taxa_names(ps.filt)
ps.rawfilt <- prune_taxa(toKeep, ps)
# Fix tax_table
tmptax <- data.frame(tax_table(ps.filt))
tmptax$Linnaeus <- paste(tmptax$Genus,tmptax$Species)
tax_table(ps.filt) <- as.matrix(tmptax)
tax_table(ps.rawfilt) <- as.matrix(tmptax)
# Plot again at Genus level
plot_bar(ps.filt, x="id_sample",fill="Genus")
plot_bar(ps.rawfilt, x="id_sample",fill="Genus")
# Plot at species level
plot_bar(ps.filt, x="id_sample",fill="Linnaeus" ) +
theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="right")+
guides(color=guide_legend(ncol=1, byrow=FALSE))
plot_bar(ps.rawfilt, x="id_sample",fill="Linnaeus" ) +
theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="right")+
guides(color=guide_legend(ncol=1, byrow=FALSE))
Figure 14. Abundance at the genus level, before and after filtering; abundance at the species level, before and after filtering.
Did rare ASVs that were removed from this filtering step
represent a large proportion of all ASVs in our samples?
> No, rare ASVs that were removed from this filtering step did not
represent a large proportion of all ASVs in our samples. We can see that
by comparing the plot for the filtered dataset with the first plot.
# Question 1
ps.rawfilt.q1 = subset_samples(ps.rawfilt, time!="END")
ps.q3 = subset_samples(ps, time!="START")
ps.filt.q3 = subset_samples(ps.filt, time!="START")
ps.rawfilt.q3 = subset_samples(ps.rawfilt, time!="START")
pl.alpha <- plot_richness(ps.rawfilt, color = "inoculum", x="inoculum", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)
# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## inoculum 2 26.08 13.042 2.748 0.0807 .
## reads 1 27.83 27.830 5.864 0.0219 *
## DNAconc 1 3.39 3.394 0.715 0.4046
## Residuals 29 137.63 4.746
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## inoculum 2 0.6903 0.3452 4.932 0.0143 *
## reads 1 0.0190 0.0190 0.271 0.6064
## DNAconc 1 0.1750 0.1750 2.500 0.1247
## Residuals 29 2.0293 0.0700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Simpson ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## inoculum 2 0.07078 0.03539 3.751 0.0356 *
## reads 1 0.00193 0.00193 0.204 0.6547
## DNAconc 1 0.02409 0.02409 2.553 0.1209
## Residuals 29 0.27362 0.00944
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How similar are the communities in the two kefir starter
cultures?
> We can see that the two starters communities have similar richness
(p = 0.0807) in the leftmost plot (“Observed”), but Shannon and Simpson
are significantly different (p = 0.0143 and 0.0356, respectively). The
Shannon value is a measure of evenness, weighing equally richness and
abundance, and the Simpson measure is the probability of encounter, and
thus a measure of dominance. So our cultures have similar richness but
different evenness and dominance. In all three plots, the commmunity
from Philipp looks very close to the start culture, whereas Vincent’s
community is more different.
# Question 2
pl.alpha <- plot_richness(ps.rawfilt, color = "time", x="time", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)
# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## time 4 31.02 7.75 1.681 0.1834
## reads 1 33.91 33.91 7.351 0.0115 *
## DNAconc 1 5.48 5.48 1.187 0.2855
## Residuals 27 124.54 4.61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## time 4 0.7282 0.18205 2.510 0.0653 .
## reads 1 0.0152 0.01518 0.209 0.6510
## DNAconc 1 0.2118 0.21183 2.921 0.0989 .
## Residuals 27 1.9584 0.07253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Simpson ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## time 4 0.07411 0.018527 1.892 0.1408
## reads 1 0.00188 0.001877 0.192 0.6650
## DNAconc 1 0.03005 0.030048 3.069 0.0912 .
## Residuals 27 0.26438 0.009792
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How stable are the communities over time?
> Communities are stable over time, and we see that the start and T4
are very similar. Moreover, the p-value are not significant (p = 0.1834,
0.0653, 0.1408) between our time variable and our community
composition.
# Question 3
pl.alpha <- plot_richness(ps.rawfilt.q3, color = "fruit", x="fruit", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)
# Let's run a simple ANOVA on richness (the other metrics are non-linear)
ps.alpha <- estimate_richness(ps.rawfilt.q3, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt.q3), ps.alpha)
ps.dataA$reads <- sample_sums(ps.q3)
ps.alpha.aov <- aov(Observed ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## fruit 1 0.01 0.01 0.002 0.968
## reads 1 7.97 7.97 1.393 0.249
## DNAconc 1 14.67 14.67 2.564 0.121
## Residuals 26 148.72 5.72
ps.alpha.aov <- aov(Shannon ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## fruit 1 0.0237 0.02366 0.262 0.613
## reads 1 0.1455 0.14548 1.610 0.216
## DNAconc 1 0.0522 0.05220 0.578 0.454
## Residuals 26 2.3500 0.09039
ps.alpha.aov <- aov(Simpson ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## fruit 1 0.00356 0.003563 0.326 0.573
## reads 1 0.01298 0.012979 1.189 0.286
## DNAconc 1 0.00563 0.005632 0.516 0.479
## Residuals 26 0.28390 0.010919
Do they change in response to the environment?
> We can see that the fruit do not influence our diversity, and the
p-values for “Observed”, “Shannon” and “Simpson” are not significant (p
= 0.968, 0.613, 0.573).
pl.alpha <- plot_richness(ps.rawfilt, color = "treatment", x="treatment", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)
Figure 15. Three different measures of alpha diversity, colors represent the origin of the inoculum (first plot), time (second plot), fruit (third plot) and treatment (fourth plot).
# Let's run a simple ANOVA on richness (the other metrics are non-linear)
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ treatment + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 4 31.02 7.75 1.681 0.1834
## reads 1 33.91 33.91 7.351 0.0115 *
## DNAconc 1 5.48 5.48 1.187 0.2855
## Residuals 27 124.54 4.61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Now let's do the same thing at the genus level, displaying only the 20 most abundant genera
# select top 10 most prevalent Families over the whole dataset
ps.gen <- tax_glom(ps.filt.q3, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)
# Plot each treatment group next to each other
ps.gen.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
facet_grid(~ inoculum, scales = "free_x", space = "free_x") +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the Genus level between Philipp's and Vincent's kefir cultures")
ps.gen <- tax_glom(ps.filt, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)
ps.gen.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
facet_grid(~ time, scales = "free_x", space = "free_x") +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the Genus level between start and all the time points")
ps.gen.q3 <- tax_glom(ps.filt.q3, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen.q3), TRUE)[1:10])
ps.gen.q3 <- prune_taxa(Top10_gen, ps.gen.q3)
# Melt them for plotting
ps.gen.q3.df <- psmelt(ps.gen.q3)
ps.gen.q3.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
facet_grid(~ fruit, scales = "free_x", space = "free_x") +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the Genus level between figs and raisins")
Figure 16. Relative abundance for every sample, at the genus level, grouped by the origin of the culture (first plot), the different time points (second plot) and the food (third plot).
Do you see any pattern between comparison groups?
> No, we do not see any pattern, all abundance levels are similar.
The different characteristics do not seem to have an influence: Philip
vs Vincent, timepoints or food. In all cases, we find the same genus,
Lactobacillus is largely dominant and we also find some Acetobacter,
Gluconobacter and Staphylococcus.
How much influence do you think the starting culture had on the
rest of the samples?
> All our samples come from the same starting culture, so it seems to
be the most influential factor. However, we did not try with any other
starter cultures so we can not do a comparison. As the communities
appear not to be affected by the environment they were cultivated in and
remain similar to the original culture, it is highly possible that the
starter culture is the most influential factor.
# Plot the proportions to better compare the communities
pl.com.pro <- plot_bar(ps.filt, x="id_sample",fill="Linnaeus") +
facet_wrap(~treatment, scales="free_x", nrow=1) +
theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="bottom")+
guides(color=guide_legend(ncol=6, byrow=FALSE))
pl.com.pro
Figure 17. Relative abundance at the species level, grouped by treatment (origin of culture and food).
# Ordinate using Principal Coordinate Analysis & NMDS
ps.nmds.bray <- ordinate(ps.rawfilt, "NMDS", "bray",trymax=50)
# Check stress level and dimensions used
ps.nmds.bray
##
## Call:
## metaMDS(comm = veganifyOTU(physeq), distance = distance, trymax = 50)
##
## global Multidimensional Scaling using monoMDS
##
## Data: wisconsin(sqrt(veganifyOTU(physeq)))
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1310455
## Stress type 1, weak ties
## No convergent solutions - best solution after 50 tries
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'wisconsin(sqrt(veganifyOTU(physeq)))'
# Plot NMDS
pl.nmds.bray <- plot_ordination(ps.rawfilt, ps.nmds.bray, color="treatment",title="NDMS of Bray-Curtis dissimilarities",label="id_sample" ) +
stat_ellipse(aes(group = inoculum))
pl.nmds.bray
Figure 18. Non-metric multidimensional scaling (NMDS) of all the samples based on Bray-Curtis dissimilarities, colored by treatment.
# Calculate Bray-Curtis distance
ps.bray <- phyloseq::distance(ps.rawfilt, "bray")
# Test with ADONIS
ps.adonis.int <- adonis2(ps.bray ~ reads + (inoculum*fruit), data = ps.dataA, perm =9999)
ps.adonis.int
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 9999
##
## adonis2(formula = ps.bray ~ reads + (inoculum * fruit), data = ps.dataA, permutations = 9999)
## Df SumOfSqs R2 F Pr(>F)
## reads 1 0.41510 0.15986 7.9239 0.0001 ***
## inoculum 2 0.62683 0.24140 5.9829 0.0001 ***
## fruit 1 0.04455 0.01716 0.8504 0.4860
## inoculum:fruit 1 0.04335 0.01669 0.8275 0.4972
## Residual 28 1.46678 0.56489
## Total 33 2.59661 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What seems to be the most influential factor explaining the
differences between bacterial kefir communities?
> The most influential factor is the inoculum, because Philipp’s and
Vincent’s starters are separated, but we find both fruits in both
clusters. The p-value for inoculum is very significant (p = 0.0001), but
the p-value for fruit is not significant (p = 0.4964).
The analysis of 16S rRNA amplicon sequencing data revealed that the most abundant ASVs were assigned to the phyla Firmicutes. Furthermore, Proteobacteria and Actinobacteriota were also present in our samples. While 16S rRNA amplicon sequencing typically does not provide identification at the species or strain level, the taxonomic tool used in our study allowed us to achieve a finer resolution approaching the species level. As anticipated, a significant proportion of ASVs were assigned to Lactobacillus species (Firmicutes). These findings align with the species previously characterized as key members of the water kefir community. Despite the fermentation steps taking place in different households and non-sterile conditions, the composition and structure of the various communities remained relatively stable over time. When analyzing the influence of the four treatments on the water kefir community, it became evident that the microbial species diversity of the samples was primarily determined by the starter culture. In the metagenomic analysis, the majority of reads were assigned to the Lactobacillaceae family, consistent with the 16S analysis and existing literature.